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When a model may be fitted separately to each individual statis- 
tical unit, inspection of the point estimates may help the statistician 
to understand between-individual variability and to identify possi- 
ble relationships. However, some information will be lost in such an 
approach because estimation uncertainty is disregarded. We present 
a comparative method for exploratory repeated-measures analysis to 
complement the point estimates that was motivated by and is demon- 
strated by analysis of data from the CADET II breast-cancer screen- 
ing study. The approach helped to flag up some unusual reader be- 
havior, to assess differences in performance, and to identify potential 
random-effects models for further analysis. 

1. Introduction. In this article we propose an approach for exploratory 
repeated-measures analysis. The term repeated measures is used in a loose 
sense to mean that more than one datum is recorded on each individual 
unit. However, the measurements themselves will be permitted to have any 
data structure with a likelihood function, perhaps ranging from replicated 
readings of the same quantity to multivariate measurements of a stochastic 
process through time. The exploratory method was motivated and is applied 
to data from the computer aided detection evaluation trial (CADET) II trial, 
where 27 human readers inspected distinct mammograms (breast x-rays) for 
cancer screening. Our analysis aim is to determine whether real differences in 
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behavior exist between the individual readers, including whether any might 
be outliers, and then if heterogeneity is observed, to seek possible groups 
of similar individuals, and factors that correlate with the differences. The 
proposal is partly motivated by the difficulty of such an objective when 
the sample size is 27, even when up to several thousand measurements are 
observed on each reader. The approach is developed in the next section 
and then it is demonstrated using the data. Conclusions follow a section 
discussing the application of the method to other data sets. 

2. Method. The general data structure is first described and the main 
similarity-matrix idea is defined. Then, some properties of the matrix are 
recorded and we comment on some ways in which it may be used for ex- 
ploratory repeated- measures analysis. 

2.1. Setup. Suppose there are n individual units {i = l,...,n) with rii 
repeated measurements = (yj^i, . . . , observed. The application in 
this paper has the units as humans who interpret mammograms for cancer 
screening. We assume that there is a suitable model form for the probabil- 
ity mass or density function p(y|uj) parametrized by 

where m is the dimension of each Uj. For example, if the results are binary 
indicators for recall [y = 1) or no action (y = 0), then p{y\ui) might be a bi- 
nomial model (m = 1) with parameter Ui interpreted as i's probability of 
recall. More generally, p(y|uj) could be developed from a data analysis, or 
knowledge of the problem, but we assume that the Uj occur in the same form 
for each individual p(yj|uj). The statistical modeling goal taken here is to un- 
derstand variability of the Uj's, perhaps through a model for p(u), or p(u|x) 
with explanatory variables Xj = (xji, . . . ,Xir). This two-stage model struc- 
ture taken is also taken in other areas, such as in applications using linear 
mixed models [Crowder and Hand (1990)]. 

Note that the setup considered is different to generalized estimating equa- 
tions (GEE). These are used to estimate marginal (population-averaged) 
regression coefficients /3 in a repeated measures context where E(yj) = 
/i(xj;/3), but without assuming a full probability model for or even 
a "true" covariance structure for y^. In this paper we have a full (condi- 
tional) probability model for yj, p(yj|uj), that is based on subject-specific 
parameters (random effects) and the focus is upon their distribution over 
the population. 

2.2. The similarity matrix. The exploratory measure that we call a sim- 
ilarity matrix is obtained in two steps: 

(1) Compute consistent Uj for i = 1, . . . ,n, such as maximum likelihood 
estimates of u^; then 
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(2) calculate the z-matrix with row i = 1, . . . ,n and column j = 1, . . . , n 
entries from 

(1) z - ^(^^"^^-^ 



Efc=iP(yi|ufc)' 

A likelihood function p(data|0) reveals the relative plausibilities of different 
parameter 0- values in the light of the data. Here Zij cxp(yi|uj) does likewise 
for the Uj (j = 1, . . . ,n) in light of the data y^. The Zij quantity thus ex- 
plores the similarity of the Uj's via their estimates, by measuring how close 
individual j's parameter fit is to individual i's data. 

If Uii is a sequence of / = 1, . . . , nj binary indicators as above, and a bi- 
nomial likelihood is assumed for p{yi\ui), then using the notation = 
Er=i2/i/' have 

(2) z,-- ' " 



because cancels in the numerator and denominator. 
2.3. Some properties of the matrix. 

(1) 0<2ij <1. 

(2) Zij = 0(l/n). The practical significance is that larger matrices will 
have smaller Zij terms. 

(3) Zi+ = I]j=i ^ij = 1- 

(4) Zij < Zii for j / i if maximum-likelihood estimation is used [because 
p(yi|u) <p(yi|ui) for all u]. 

(5) The matrix follows from Bayes' rule 

Pe{u\yi) (xpiyi\u)pe{u), 

where pe (u) is a probability mass function that approximates variation of the 
random effect u across individuals p{u) by assigning mass 1/n to each of the 
points (ui, . . . ,Un)- The e subscript is used in the notation to make explicit 
the reference to this empirical distribution. That is, Zij = Pe{iii = Uj\yi), 
and the Zij quantities are posterior Uj mass values where the u distribution 
has been restricted to the points in Pe(u). 

(6) z is not symmetric unless Pe{iij = Ui\yj) equals Pe(ui = Uj|yi). 
Thus, it is not a similarity matrix in the usual sense. 

(7) When Zij = za, then y^ is equally well conditioned on Uj and Uj, and 

Pe(uj = Ujiyi) = Pe{ui = Ui\yi). 

(8) z+j > 1 means that Uj is very likely the value for many i and/or Zjj 
is relatively large. 

(9) An alternative measure z+j/zjj can be used to assess the importance 
of Uj over the i^ j. 
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(10) A measure of the overall concentration of the estimates is trace(z)/ 
n E (0, 1). Since = n, trace(z)/n attains maximum value 1 when za = 1 
for all i. 

(11) {zii, Z22, ■ ■ ■ , Znn), or diag(z) provides a comparative measure of 
concentration in the estimates. This is because point estimates Uj with rela- 
tively high (or close to 1) za entries may be interpreted as good predictions 
since 

n 

Ee(Ui|yi) = '^UjPe{Ui = Uj\yi) 

i=i 

= ^ ZijUj. 

So for Zii close to one (and therefore Zij close to for j ^ i), a prediction 
from (3) is likely to be very close to Ui] for za not close to 1, the point- 
estimate Uj may be misleading because a prediction from (3) is subject to 
nonnegligible averaging (shrinkage). 

(12) If the Uj are distinct, then as nj ^ oo for each i the z-matrix will 
converge to the identity matrix because a consistent estimator of u is used. 
In practice, this means that when the Uj are different, then a data set with 
large iii and well-estimated Uj will have a z-matrix close to the identity 
matrix. Conversely, little structure is likely to be seen when all the iii are 
small, but it may still be worth applying the method to see if this is the 
case. The most useful case is likely to be when some of the nj are mode- 
rate. 

(13) A referee suggested a possible connection with Rubin's propensity 
score [Rosenbaum and Rubin (1983)]. In that setting individuals are matched 
(one the other a control) by a propensity score e(xj) = P{ci = l|xj), 
where c is an indicator of being a case. In our setup individuals are matched 
to each other through Zij = Pe{ui = Uj|yj). The propensity score reduces 
the dimension of multivariate matching on Xj to a univariate measure; the 
z-matrix transforms the dimensionality of matching individuals on Uj to 
a two-dimensional matrix. 

2.4. Why use the matrix for exploratory analysis? The first step in the 
computation of the z-matrix is to obtain point estimates (ui, . . . , u^)- These 
might be plotted in exploratory analysis to look for clusters, outliers and 
other structural relationships or trends across individuals in the data. For 
example, one can plot the parameter fits against each other, and against 
other covariates by using a matrix scatter plot. An example demonstrat- 
ing the use of this approach for exploratory analysis with hierarchical lin- 
ear models is Bowers and Drake (2005). One issue with the plots is that 
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uncertainty in the point estimates is disregarded and so apparent trends 
may be less impressive than first appears, or masked by samphng varia- 
tion. 

A first way that the above properties of the z-matrix can be used to add 
to the information in the plots is by helping to quantify the concentration 
of each individual's estimate Uj by inspection of diag(2;). A second way is by 
making comparisons between the estimates Uj and Uj from two individuals i 
and j, through the Zij and zji terms. An example of where these proper- 
ties are useful is when the Zij entries are zero, except for those within an 
identifiable cluster of u-values from the plots. This would suggest that the 
individuals form a fairly homogeneous group. A third way is to improve the 
point-estimates Uj {j = 1, . . . ,n) themselves, by using equation (3) to shrink 
the estimates through Ee(uj|yj). A fourth way is by using quantities such 
as z+j — Zjj or Zjj/z+j for j = 1, . . . ,n to show the more important Uj, or 
to identify outliers. Some techniques to draw attention to these and other 
features are next described. 

3. Exploratory analysis with the ^-matrix. In this section we propose 
a number of ways to present the z-matrix. They will be demonstrated using 
the breast-screening data later on. 

3.1. Tabular presentation of the matrix. When printing out the matrix 
it is important to display it in such a way that important aspects of the data 
are clearly visible. With this in mind we next suggest a way to display the 
matrix in tabular form: 

• Print out the transpose of z, not z. When making comparisons between 
individuals the main interest is comparing Zij for j = 1, . . . ,n. The trans- 
pose of the z-matrix is better because, as in tables, it is easier to compare 
down columns than across rows [LGDUW (2004)]. 

• Multiply the matrix by a power of 10 (e.g., 1,000) and do not display 
(multiplied) values less than 1. The point here is to focus the eye's at- 
tention on the difference between large, small and negligible proportions 
by using the number of digits displayed in a number. For example, the 
number 1,000 is seen to be larger than 10 because it has twice the number 
of digits; it is more difficult to see at ffist glance that 0.1000 is bigger 
than 0.0010 because they have the same number of digits. The choice of 
multiplication factor should depend on n since Zij = 0{l/n). 

• Experiment with the order of individuals. The order used might be based 
on an examination of one z-matrix, to regroup similar individuals, or it 
might be made using the covariate Xj data. A recommended first order is 
by one of the u components, or a function of interest using the u's. 
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3.2. Graphical presentation of the matrix. An alternative to printing the 
matrix is to use a plot. Since X]j=i Zij = 1, a recommended display is a his- 
togram variety, where there is one bar for each cell in the matrix. Such 
a chart can be produced using a symbols plot, with rectangles of area pro- 
portional to Zij. It is arguably easier to compare the shape of histograms 
down a page (one for each of the n units), so it might be better to leave the 
matrix untransposed in this instance. Use of the transpose for printing and 
the untransposed matrix for plotting might also help the statistician to see 
different features. 

3.3. Graphs to assess the number of groups. For scalar Uj (i.e., each com- 
ponent of Uj if a vector) order the individuals by Ui. Then a plot of (uj, i/n) 
provides the estimated distribution function of u, based on the a priori Pe{u). 
Such an approach uses information in the separate u's, but due to the equal 
weights, it might be improved by using the data to change the weights from 
1/n. The proposal is to use an estimate of the density of Uj from n~^z^j. 
If the Ui^s are all well estimated and different, then the weights will not 
change much from 1/n. If some are more likely over the sample than others, 
then they will be up-weighted, and others will be down-weighted. A re- 
lated quantity is the distribution function = 'T'"^ X^j=i -z+j. Plots of the 
z-matrix density and distribution function can be used to help assess the 
number of groups in the data. 

3.4. Shrinking parameter fits. A way to incorporate estimation uncer- 
tainty into any exploratory plots involving u is to use equation (3) to shrink 
the estimates through Ee(uj|yj). In this way, outliers might be more reliably 
identified, as well as possible patterns. 

3.5. Smoothing covariates. The matching of individuals through the z- 
matrix may be used to show the average covariates x for a given Uj. This 
might aid inspection of possible correlations beyond using the observed co- 
variates Xj recorded for each individual i = 1, . . . ,n in plots against (func- 
tions of) parameters Uj. We next show how Xj = '^'^^i^kZik/z+k can be 
derived as the expected x given Uj from the z-approach. 

Suppose we have data d, known to be one of the y^, but not which one, 
and have prior P(d = y^) = 1/n for i = 1, . . . , n. If we were interested in the 
probability that the data i = l,...,n were generated by parameter fit Uj, 
then we could use Pe(d = yi|u = Uj) = Zij/z+j. Now 

n 

(4) Pe(x|u) =^p(x|d = yfc,u)Pe(d = yfe|u). 

fc=i 

In the case where the Xj are distinct, we model P(x = Xjjd = y^, u) by 
an empirical distribution so that P(x = Xj|d = y^, u) = 1 if i = A; for k = 
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1, . . . ,n, and otherwise, then P(x = x^ju) = P(d = yfc|u), leading to 

n 

Ee(x|u = Ui) = y^XfcPe(d = yfc|u = VLj) 

k=l 
n 

k=l 

In the case where the Xj are not distinct, one can still use x^ as defined 
above. A crude way to think of the approach is that individuals are locally 
clustered depending on their u's, and the average covariate at that cluster 
is obtained. Thus, given u, the variation in the x's is much less than the 
original x's. It is hoped that the process will smooth out some sampling 
variation, making it easier to assess if there are any real patterns of interest 
between x and u. 

3.6. Graphical testing. A last exploratory approach is to follow Gelman 
(2004), Buja et al. (2009) and others by comparing ^-matrices or associated 
plots against null model simulations. 

4. Background to application. Two human readers are presently used in 
England to interpret mammograms (breast x-rays) from the breast-cancer 
screening program. This regimen is often called double reading, but we will 
call it dual reading to emphasize that two independent readers inspect each 
mammogram. If both readers find no abnormalities, the screenee is notified 
of the negative result and no further action is taken. If both readers find 
a suspicious abnormality, the screenee is recalled for further investigations. 
If the readers disagree, one common practice is to have a third reader ar- 
bitrate. Typically, for 1,000 women undergoing screening, around 42 might 
be recalled, of whom 8 are found to have cancer after further investigation 
[NHS Breast Screening Programme (2009)]. Several studies have shown that 
two readers can detect more cancers than a single reader [Taylor and Potts 
(2008)]. The computer aided detection evaluation trial (CADET) II was de- 
signed to assess whether a single reader using a computer-aided detection 
tool could match the performance of two readers. 

In the trial 31,057 mammograms were read at three centers in England, 
such that a ratio of 1 : 1 : 28 were, respectively, dual reading only; single- 
reading with CAD (computer-aided detection) only; and both dual reading 
and single reading with CAD. Most of the screens were therefore matched 
pairs from dual reading and single reading with CAD. The reason why some 
screens were only read by one of the regimens was to reduce the possibility 
of bias from readers changing their behavior due to the knowledge that 
a further reading of the case would take place. Only the 28,204 matched- 
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pair cases are considered from now on. The main detection result was that 
199 out of the 227 cancers detected were recahed by dual reading, and 198 by 
single reading with CAD. 170 of the cases were detected by both, so the single 
readers with CAD detected 28 cases missed by dual reading; dual reading 
detected 29 cases missed by the single reader with CAD (and 170 + 28 + 29 = 
227). The overall recall rate for dual reading was 3.4% and for single reading 
with CAD it was slightly higher at 3.9%. The analysis of the trial in Gilbert 
et al. (2008) found that single reading with computer-aided detection could 
be an alternative to dual reading. 

The primary analysis published in Gilbert et al. (2008) addressed the 
question of whether detection and recall rates differ between dual reading 
versus single reading with CAD. Further questions may be posed of the data 
to help improve best practice in other areas: if factors can be identified that 
predict outcomes prior to the screen being read, then steps might be taken 
to mitigate risks. The aim of the analysis in this article is to assess whether 
individual readers behaved differently, and to determine if any factors might 
influence whether a reader missed more cancers, or recalled more often than 
others. In the data available from the trial we had information on their 
training (radiologist, radiographer, other) and the number of years they had 
read mammograms prior to the trial, and we explore whether any differences 
between them might be related to these two factors. Although there are 
a large number of screens, the total number of readers involved in the trial 
was 27, and so drawing inference is more difficult than might appear from 
consideration of the large number of 28,204 cases. 

5. Reader recall and detection rates. In this section we use data from 
CADET II to demonstrate the z-matrix exploratory analysis as a precursor 
to model building. The aim of the analysis is to explore the data to assess 
if and why some readers performed differently to others. 

5.1. Data. We present two exploratory analyses, one for the first reader 
in a dual-reader pair, the second for a single reader with CAD. In the case 
of a first reader i from a dual reading the response is detection of cancer: 
y = 1 when a cancer is detected, otherwise. In the case of a single reader i 
with CAD the response is recall: y = 1 if a case is recalled, if not. There 
are A: = 1, . . . , rij screens by individual i, and we take 

p{y,\ui)cxuf+{l-Uir^-y^+, 

thus assuming that the Uik are conditionally independent with P{yik = 1) = 
Ui. These data are shown in Tables 1 and 2. The total number of readers 
in both regimens differs partly due to not all of them being trained to used 
CAD. Further details about the data are in Gilbert et al. (2008). 
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Table 1 

Number of cases detected and recalled by first 26 dual readers. Analysis is undertaken 

for detection rate Ui 





Cancers 


Screens 


MLE (%) 


Concentration 


Center 


Vi+ 


Recalls 


rii 


Ui = yij^/m 


{zii X 1,000) 


2 


2 


10 


18 


11.1 


290 


2 


8 


26 


92 


8.7 


375 


2 


4 


19 


53 


7.5 


363 


3 


5 


11 


355 


1.4 


108 


1 


5 


16 


394 


1.3 


92 


3 


9 


27 


805 


1.1 


103 


2 


11 


36 


1,022 


1.1 


109 


1 


15 


62 


1,412 


1.1 


124 


1 


6 


24 


628 


1.0 


73 


2 


18 


76 


1,922 


0.9 


124 


2 


11 


46 


1,384 


0.8 


83 


2 


14 


67 


2,128 


0.7 


82 


1 


8 


62 


1,221 


0.7 


68 


3 


5 


25 


769 


0.7 


60 


1 


1 


3 


160 


0.6 


47 


3 


6 


29 


997 


0.6 


64 


3 


12 


61 


2,002 


0.6 


76 


2 


7 


34 


1,180 


0.6 


66 


3 


5 


23 


906 


0.6 


63 


3 


7 


27 


1,312 


0.5 


69 


1 


8 


51 


1,571 


0.5 


74 


1 


10 


57 


2,132 


0.5 


86 


2 


7 


40 


1,556 


0.4 


82 


1 


3 


21 


735 


0.4 


72 


3 


8 


48 


2,166 


0.4 


124 


1 


4 


46 


1,284 


0.3 


127 


OVERALL 


199 


947 


28,204 


0.7 





Exploratory analyses were also conducted on other combinations of inter- 
est, such as on detection rate for single readers with CAD and for second 
readers in a dual-reader pair. The two presented are chosen because they 
show how the z-matrix can help to identify similar groups of readers. 

5.2. Exploratory analysis with similarity matrix. For detection rate the 
point estimates Ui in Table 1 show a group of three individuals with much 
higher detection rates than the others. Although these readers saw a rela- 
tively small number of cases, and the numbers detected are not larger than 
other readers, the z-matrix [Supplementary Table 1, Brentnall et al. (2011)] 
suggests the differences are not due to chance. The z-matrix has a block 
structure with one block corresponding to the 3 outlying readers, and the 
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Table 2 

Number of cases detected and recalled by 18 computer-assisted readers. Three cancer cases 
from center 2 had a missing reader identifier. Analysis is carried out on recall rate Ui 







Recalls 


Screens 


MLE (%) 


Concentration 


Center 


Cancers 






U-i = V-i-i- / Tt-1 


{zii X 1,000) 


2 


11 


57 


953 


6.0 


170 


2 


11 


64 


1,080 


5.9 


170 


2 


14 


59 


1,012 


5.8 


160 


2 


7 


61 


1,062 


5.7 


156 


2 


9 


69 


1,257 


5.5 


156 


2 


9 


49 


921 


5.3 


143 


1 


16 


113 


2,408 


4.7 


257 


2 


8 


46 


993 


4.6 


168 


2 


5 


46 


1,037 


4.4 


183 


1 


12 


87 


2,150 


4.0 


322 


2 


5 


36 


1,037 


3.5 


172 


1 


11 


76 


2,266 


3.4 


249 


3 


9 


61 


2,045 


3.0 


171 


1 


17 


79 


2,713 


2.9 


180 


3 


7 


27 


953 


2.8 


141 


3 


25 


84 


3,089 


2.7 


188 


3 


9 


48 


1,835 


2.6 


183 


3 


10 


35 


1,390 


2.5 


192 


2 (Unk.) 


3 


3 


3 






OVERALL 


198 


1,097 


28,204 


3.9 





other block to everyone else. This can also be seen in Figure 1, which con- 
tains plots introduced in Section 3.1. The charts suggest that the results 
might be too extreme to be due to random variation, and that there are two 
groups. Further evidence of this is seen in the zu measures of concentration 
for first dual reader detection rate, shown in Table 1: are all low. 

For single-readers with CAD the z-matrix based on data in Table 2 is 
shown in Table 3. The number of digits reading down each column gives 
an impression of the size of each Zij for j = 1, . . . , 18, and the table shows 
a center effect where the clearest difference is between centers 2 and 3, with 
readers from center 1 straddling the two. 

5.3. Model. The exploratory analysis suggests that a continuous model 
of p{u) might be inappropriate because there appear to be clusters, or at least 
there is not enough information to separate individuals within the clusters. 
Therefore, a more plausible approach than a continuous distribution is to 
take a discrete distribution for u, with unknown locations Uj and masses 9j 
for J = 1, . . . , fc, where k <n. That is, P{u = uj; 6) = 6j. The nonparamet- 
ric maximum likelihood (NPML) estimate of p{u) is a discrete distribution 
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log{u) 

(b) 

Fig. 1. Exploratory plots for detection rate, first dual reader. On the x-axis are Ui on 
a log^Q scale. The y-axis in (a) is Zk as defined in Section 3.1, on (b) it is z+j, which 
corresponds to the jump sizes in (a) . 



and has the benefit of not requiring specification of the form of p{u). An 
expectation-maximization (EM) algorithm is used to next obtain the NPML 
estimates [Laird (1978)], and a hkehhood-ratio test is used to compare the 
model fit against a null model with a single atom. The p-values presented 
follow Self and Liang (1987), and are used as a way to show the evidence 
for the fitted model, rather than to formally control type I error. This is 
relevant because the test is post hoc based on exploratory analysis, so there 
is an element of multiple testing. 

5.4. Results. For the first reader, the EM-algorithm fit has just two 
atoms at (0.0066, 0.0855) with respective masses (0.891, 0.109) and log- 
likelihood — 1,170.151. This compares against a null model with a single 
point 0.0071 and log-likelihood —1,184.125. A likelihood-ratio test to com- 
pare the models rejects the hypothesis of no difference, with p- value < 0.001. 
The estimation results for p{u) corroborate the exploratory analysis: the first 
location (0.0071) is for the majority of readers (the mass is 0.891); the sec- 
ond location (0.0855) is for the top 3 readers in Table 1 with much higher 
detection rates. 

The model fit for recall rate by readers using CAD also confirms the 
exploratory analysis. There are two points at (0.0293, 0.0507) with re- 
spective mass (0.449 0.551) and log-likelihood —4,606.186. The degener- 
ate fit is 0.0389 and has log-likelihood —4,637.097, so a likelihood-ratio 

value < 0.001. 



Table 3 

(z^ X 1,000) -matrix for reader recall rates using CAD. Within center the individuals are ordered ascending by Ui. Note that the z-matrix 
IS transposed in all the tables in this article, and so, for example, the diagonal is the largest value down each column, not row 
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5.5. Interpretation. The unusual group of three readers' detection rates, 
within the same center, can be explained by job title: they were the only ra- 
diographers in that center. However, it is unlikely that radiographers are as- 
signed more cancer cases than radiologists because the outcome is unknown 
prior to the screening. It seems more likely that center 2 used a post-event 
method of deciding who to call the first reader. This is discussed further in 
the next section. 

Single readers with CAD were found overall to have higher recall rates 
(3.9%) than dual reading (3.4%). Table 2 shows that most of the readers 
with higher recall rates were in center 2. The z-matrix in Table 3 and the 
model estimation results point toward a difference that is linked to center 2. 
That is, the slight overall increase in recall rate of single reading with CAD 
over dual reading might have been caused by a policy difference, or difference 
in case-mix at one of the centers rather than errant individual readers. 

6. Categorical dual-reader outcomes. The analysis in the previous sec- 
tion focused on binary outcomes. One of the advantages of the similarity 
matrix for exploratory analysis is that it can be readily applied to any like- 
lihood model p(y|u). In this section we show an exploratory analysis of 
dual-reader performance when 6 categorical outcomes are considered and 
the likelihood of multinomial form. 

6.1. Data. In this analysis each screenee belongs to a state Sim, where 
/ = 1, 2, respectively, denote a decision to recall or not by a reader from the 
dual-reading regimen; m = 1 for cancer present, m = 2 for cancer absent 
and m = 3 for cancer unknown. Thus, P{yik = Sim) = uum for each case 
k = 1, . . . ,ni seen by reader i. 

The different states arise because even if a reader does not flag (or does 
flag) a case for recall, they may (or, respectively, may not) be recalled in 
the trial. More specifically, when a case was fiagged for recall and it was 
recalled for further tests it is known whether there was a cancer. When it 
was not recalled by the dual readers or the single reader with CAD we call it 
"unknown" because no further tests are undertaken, but the vast majority 
of such cases will not have a cancer present. Cases that are flagged for recall 
but are unknown were not recalled after arbitration, or were not flagged 
for recall by the single reader with CAD. Cases that were not flagged for 
recall but the outcome is known might have been recalled after arbitration, 
or recalled by the single reader with CAD. 

The data are found in Supplementary Table 2. All dual readers are in- 
cluded, so we ignore whether the reader was marked as a first or second 
reader. Some of the second reader identifiers at center 2 were missing, but 
all data were available for the other centers. 
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Table 4 

(z^ X l,QQQ)-matrix for dual-reader categorical outcomes within center 1. The data are in 

Supplementary Table 2 



Experience: 5 14 0.5 4 6 12 15 18 

993 3 



997 


13 








1 




3 


973 






1 


4 


2 




8 


579 


153 


156 


356 


20 






50 


655 


94 


117 


22 






29 


59 


666 


56 


39 




3 


341 


133 


82 


465 


24 
892 



6.2. Exploratory analysis. Inspection of a scatter-matrix plot of itijk val- 
ues for the states Sim and reader experience, over readers i = l, . . . , 27, does 
not show any clear trends (Supplementary Figure 1) apart from a possible 
difference between the centers. However, a pattern is present in Sis vs. expe- 
rience but it is masked somewhat by between-center differences and sampling 
variation. The following exploration of the z-matrix in conjunction with the 
data helped to determine if there were any systematic differences between 
the readers, and to identify and show more clearly the correlation between 
experience and 5i3. 

The difference between the centers is backed up by the overall z-matrix 
(Supplementary Table 3): it has a block structure by center. Separate z- 
matrices were produced to further investigate possible differences between 
readers within each center. The z-matrix for center 1 is in Table 4. It shows 
that the readers with 0.5, 5 and 14 years experience appeared to be different 
from the other readers. It can be seen from the z-matrix in Table 5 that 
readers in center 2 were harder to tell apart, but there were possibly two 
distinct groups. However, these did not appear to be correlated with reader 
experience. In passing, we note that little attention should be paid to the 
reader i with 0.5 years experience because z+i — za = and so their pa- 
rameter fit is incompatible with all other readers' data; but Zij > for all j 
with Zjj = 0.489 and so i's data are not incompatible with the other readers' 
parameters. This asymmetry occurs because they read relatively few mam- 
mograms (Supplementary Table 2). Table 6 shows the z-matrix for center 3. 
This is quite different to the other centers because the concentration mea- 
sures Zii are high for all except one reader. Since the readers saw a similar 
number of screens to the other centers, a systematic effect is likely to be 
present within the center. Further inspection of experience against the Uj's 
within center 3 showed a potential link between Sis and reader experience. 
To obtain further understanding, the categorical response was dichotomized 
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Table 5 

(z^ X IjQQQ) -matrix for dual-reader categorical outcomes within center 2. The data are in 

Supplementary Table 2 
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into 5i3 against the rest, and a z-matrix for centers 1 and 3 was obtained (all 
readers in center 2 had S'la = 0), as shown in Supplementary Table 4. The 
ordering of individuals by their estimate Ui appears to relate to experience 
shown in the second row of the table, and the matrix pattern is inconsis- 
tent with a null hypothesis where everyone has the same Ui (Figure 3). The 
correlation to experience is most clearly displayed in Figures 2(b), (c). 

6.3. Interpretation. The readers in each center worked independently, 
and made their recall decisions on their own. However, in center 2 the ar- 
bitration process involved discussion between several readers, once a dis- 
agreement was found between the first and second reader. This might be 
why readers in center 2 were recorded as first or second reader after the 
outcomes had been observed, and why the data (Supplementary Table 2) 



Table 6 

(z^ X l,QQQ)-matrix for dual-reader categorical outcomes within center 3. The data are in 

Supplementary Table 2 
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Fig. 2. Exploratory plots for recall-but-overruled rate vs. reader experience (years), based 
on data from Supplementary Table 2 (S13 vs. the rest). The data show some evidence that 
dual readers with less experience are more likely to be overruled in centers 1 and 3. Readers 
from center 2 are excluded from the plots because that center did not record when a reader 
flagged a case for recall that was not recalled (all S\2. =0^. Plot (a) shows experience 
against the original estimates u, with center number as the symbol. Plot (b) uses expected 
experience on the y-axis, following the approach in Section 3.5. The plots are presented 
with experience on the y-axis because they show a quantity for the expected experience 
given u. Plot (c) replaces the original estimate u by a prediction from equation (3). Each 
dashed line ( — ) is a loess smoother fit. Plot (d) shows how expected experience relates to 
the original data. 



show that when a reader in center 2 flagged a case for recall, the case was 
always recalled regardless of the other reader. In any case, it is clear that, as 
originally recorded, it is difficult to compare readers from center 2 with the 
others in this analysis, and so it is reasonable to leave them out of Figure 2. 

The statistical structure shown by the z-matrix exploration and in Fig- 
ure 2, where the less-experienced readers tended to be overruled more often, 
fits with a training effect. It is common practice in dual reading to pair 
experienced readers with less experienced ones. Thus, the increased rate of 
overruled recall flags (by 3 different readers: the other, generally more expe- 
rienced dual reader, an arbitrator and the independent reader with CAD) 
might be linked with less experienced readers being more cautious in their 
recall decision. Overall, dual reading mitigates this by pairing inexperienced 
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Fig. 3. Graphical testing of the z-matrix printed out m Supplementary Table 4- The 
top-left graphic shows the z-matrix (not transposed as in Supplementary Table 4) where 
each cell is represented by a rectangle with area proportional to Zij . The rows are therefore 
histograms with the same total area for each row. The other three graphics are simulated 
z-matrices using the overall u, obtained by pooling all the data. The same number of 
screens (ui) were simulated for each reader i as in the data, and the matrices were ordered 
descending by simulated in, for consistency with Supplementary Table 4- This graphical 
test suggests that there is some evidence to reject a null hypothesis that all readers have 
the same recall-but-overruled rate. 

readers with experienced ones who are able to overrule unnecessary recalls. 
It is unclear whether single reading with CAD would similarly mitigate this 
because, although the average experience of readers using CAD in CADET 
II was similar to dual reading, the minimum experience was 4 years (com- 
pare Figure 2). Thus, in any implementation of screening based on a single 
reader with CAD, it might be worth monitoring recall rates for readers with 
less than 4 years experience. 

7. Dual reading vs. CAD false recall. So far we have considered analysis 
of the two screening regimens separately. Further modeling may be used 
to look at them together. We end by investigating the difference in recall 
rate between single readers with CAD (3.9%) and dual reading (3.4%). To 
show the technique from a different angle, we proceed as if we did not know 
about the z-matrix, and first fit a statistical model to the data. Then, the 
2;-matrix will be used to help provide more understanding of what the model 
has found. 

7.1. Data. Let yik = 1 if CAD reader i recalls case k = 1, . . . ,nj where 
no cancer is detected on recall but dual reading does not, and Uik = if 
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Table 7 

Number of noncancers recalled by CAD reader (yi+) when dual 
readers did not recall, for all cases recalled m error by either 
CAD readers or dual readers (but not both) 
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CAD reader i does not recall the case where no cancer is detected on recall 
but dual reading does. Note that the comparison to be made is between the 
cases where single readers with CAD or dual readers flag for recall in error 
(but not both of them). The data from the trial are shown in Table 7: if 
Ui < 0.5, then the CAD reader did better than dual readers, and if Ui > 0.5, 
then they did worse. 

7.2. Model. Consider a model 

logit{P(yifc = l\:>^^,Vi■, fi, a^)} = Xik(3' + Vi, 

where (3 = (/3o, /3i, /32) are parameters and Xj = Xi2, a^is) are covariates; 
Vi is a random effect taken (for convenience) to be from a Normal distribution 
with mean and variance c^; and logit(-) denotes the logistic function. 
The covariates are a constant {xn = 1), a factor for center 2 (xi2 = 1 for 
center 2, otherwise) and a factor for reader experience (^,3), whose form 
is explored below. Thus, the baseline is for centers 1 and 3 and readers 
with the reference reader experience. Other covariates [about the screen: 
first ever screen (incident) or not (prevalent), age, a score from the CAD 
algorithm predicting the likelihood of cancer; and about the reader: training 
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(radiographer, radiologist, other)] were explored but did not significantly 
improve the model fit. 

Maximum-likelihood estimation (the routine xtlogit in the computer 
software STATA that uses Gauss-Hermite quadrature for the likelihood) is 
used to find odds ratios and Wald 95% confidence bounds on the effects. The 
first definition of reader experience is a binary variable Xjs = 1 when reader i 
has more than six years experience, otherwise. This definition was chosen 
because it roughly balances the readers by center, as seen in Table 7. The 
estimated odds ratios for center 2 and reader experience effects are, respec- 
tively, 1.542.OI2.6I and 1.231-592.06) where we use the useful notation from 
Louis and Zeger (2009) to present the point estimate surrounded by a 95% 
confidence interval. Using this definition of experience seems to account for 
most between-reader variation because ln(o"^) = —13.6(43 q) [again following 
Louis and Zeger (2009) to put the standard error as a subscript]. Indeed, 
identical odds ratios are found from a straight logistic regression without Vi . 
Other definitions of reader experience suggest that a linear relationship is 
not a good one: if years of experience are used, then the odds ratio esti- 
mate is i.ool-02i.o5, and the random-effect term becomes more important 
with ln((7^) = 0.15(0.01)- Another possibility is to use log (experience), which 
resulted in an estimated reader experience odds ratio of i.o4l-33i.7o- 

The model fits provide some evidence that, perhaps surprisingly, the less 
experienced readers were less likely to recall in error with CAD than the 
experienced ones. This is different to the trend seen in Astley et al. (2006), 
although that was a retrospective study. Taken together with the results in 
Section 6.3, this might be interesting because it suggests that CAD might 
help the less experienced readers (<7 years) avoid unnecessary recall deci- 
sions. However, given that n = 18, one might be interested in understanding 
more about the data's structure, especially given the change in effect size 
depending on reader experience definition. We will proceed to further inves- 
tigate using the z-matrix and some of the plots previously used. 

7.3. z-matrix analysis. The 2:-matrix is shown in Supplementary Table 5. 
The data driving the experience effect from the model are that two readers 
with 6 and 4 years experience have relatively low Uj, with their Zij close to 
the other's zji, and they are relatively concentrated; and one reader with 17 
years experience has the highest Uj, which is also more concentrated than 
those in between. A center structure can also be observed: center 2 against 
the others. The experience pattern is also seen in Figure 4, where the three 
readers are clear in plots (b) and (c). However, some caution in interpreting 
the reader experience correlation is required: differences are seen between 
the null and observed z-matrices in Supplementary Figure 2, but the pattern 
of two low Ui and a single high Ui might be due to chance. The plot casts 
doubt on whether the pattern is real, or whether it was (mis)fortune that 
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Fig. 4. Exploratory plots for single reader with CAD recall-in- error rate (relative to 
double reading) vs. reader experience (years). Plot (a) shows experience against the original 
estimates u, with center number as the symbol. Plot (b) uses expected experience on the 
y-axis, following the approach in Section 3.5. Plot (c) replaces the original estimate u 
by a prediction from equation (3). Plot (d) shows how expected experience relates to the 
original data. 



led to the reader experience effect. A z-matrix examination therefore showed 
that the correlation between reader experience and Ui was driven by 3/18 
readers with behaviors in opposite directions, but also showed that it is quite 
a weak finding. 

7.4. Other techniques. The model fit may be explored in other ways. 
We end by using prediction to show that the center 2 effect is a more ro- 
bust finding. If a noncancer case is not recalled by the single reader with 
CAD, then, using centers 1 and 3, we fit a logistic-regression model for the 
probability of recall by dual reading with covariates for incidence/prevalence 
(first or subsequent screen) and whether the case was arbitrated. A predic- 
tion from this model is that 156 such cases could be expected at center 2. 
This compares against an observed number of 130, so the dual readers did 
slightly better than might be expected. A similar logistic-regression model 
was fitted to centers 1 and 3 for recall by the single reader with CAD, given 
the case was not a cancer and was not recalled by dual reading. A covariate 
for incidence/prevalence status was used together with a continuous variable 
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correlated to the probability of cancer according to the computer tool. This 
model predicted a total of 140 such cases, but 267 were observed. 

8. Application of the exploratory approach to other data. The z-matrix 
applies quite generally to the two-stage statistical setup described in Sec- 
tion 2. A similar data structure is found in other applications, such as the 
effect of physical tasks of patients, blood glucose levels and rat body weights 
that are in Crowder and Hand (1990); as well as many others including 
sport where individuals have repeated attempts to, for example, hit a ball 
in cricket, or score a goal in football; or in the workforce when productivity is 
measured by number of items processed by the worker. Thus, the technique 
might be used for growth curves, point processes or any other data structure 
where it is possible to write down a likelihood function for the individual. 

One aim is to use the data to find structure among the units that would 
be seen again in future samples. A common approach to this problem is 
to fit a two-stage model, which, as seen in the above data analysis, might 
produce similar findings to the z-matrix approach. However, some strengths 
of the z-matrix as an exploratory technique, relative to use of full statistical 
models, include the following: 

• As seen in Section 2.2, Zij is a comparative measure that has a direct inter- 
pretation in terms of how close individual j's parameter fit is to individ- 
ual i's data. Although other approaches can be used to estimate p(uj|y), 
they lose the direct comparative aspect that arises in the z-approach from 
restricting the u support to only contain U„ = (ui, . . . , u„). For example, 
when using NPML in Section 5 an equivalent "zjj" would have j = 1,2 
because there are two support points. 

• Plots such as Figure 1 show that the z-matrix can be used to provide 
an indication of how many distinct groups there might be; NPML simply 
gives the most likely number. For exploratory analysis both are useful. 

• The measure can be interpreted in a similar manner for different p(y\u) 
likelihoods, and the information from Uj vectors is shown in the same 
two-dimensional way for any dimension of Uj. That is, the approach stan- 
dardizes comparisons between the Uj vectors for different types of response 
variables. 

• The approach is quite general and can be easily applied to different p{y\u) 
likelihoods. Although with a binomial likelihood many other approaches 
are feasible using statistical software, this will not always be the case. 
For example, the z-approach was used for prediction when the likelihood 
function was of a self-exciting point process form in Brentnall, Crowder 
and Hand (2008). 

• For prediction the approach provides a simple approximate route to BLUP's 
(best linear unbiased predictors), or posterior means, through equation (3). 
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Some evidence of the benefit of predictions formed in this way using real 
data, compared with parametric empirical Bayes predictions, is found in 
Brentnall, Crowder and Hand (2010). 
• Finally, while computationally-intensive methods may be justified for sta- 
tistical modeling, it seems much less attractive to have to wait for ex- 
ploratory analysis to run. Once the point estimates have been obtained, 
the method requires O(n^) computations for equation (1). This makes it 
most appealing for small to moderate n. 

9. Conclusion. In this work we developed a method of exploratory anal- 
ysis for applications in which repeated measurements have been recorded 
on a group of individuals. The aim of the approach is to draw attention 
to groups of similar behaviors, outliers and trends in the data. It does so 
by helping to quantify prediction uncertainty between individual point esti- 
mates through a "similarity" measure. This z-matrix used can be viewed as 
a discrete approximation to an empirical Bayes posterior distribution. The 
approach was motivated by an analysis of reader performance in CADET II. 
We showed its application to binary and multinomial response variables, and 
illustrated some identified properties of the measure using the data. One av- 
enue for future research is to extend the approach to explicitly account for 
more than two levels in the hierarchical data structure. Such an extension 
would be useful for cancer screening since readers are sampled from screening 
centers. 

Acknowledgments. We thank Professor Karen Kafadar and two anony- 
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ology and presentation of the application in this article. 

SUPPLEMENTARY MATERIAL 

Supplement to "A method for exploratory repeated-measures analysis ap- 
plied to a breast-cancer screening study" (DOI: 10.1214/11-AOAS481SUPP; 
.zip). Some additional tables and charts to accompany this paper. 
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